Lance/nnx all by ecnal-cienet · Pull Request #3425 · AI-Hypercomputer/maxtext

ecnal-cienet · 2026-03-16T18:41:32Z

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

why is this change being made,
the problem being solved and any relevant context,
why this is a good solution,
some information about the specific implementation,
shortcomings of the solution and possible future improvements.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Notice 1: Once all tests pass, the "pull ready" label will automatically be assigned.
This label is used for administrative purposes. Please do not add it manually.

Notice 2: For external contributions, our settings currently require an approval from a MaxText maintainer to trigger CI tests.

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-03-16T18:46:44Z

Codecov Report

❌ Patch coverage is 69.12243% with 285 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/trainers/pre_train/train.py	52.59%	51 Missing and 22 partials ⚠️
src/maxtext/layers/nnx_decoders.py	16.00%	58 Missing and 5 partials ⚠️
...xtext/checkpoint_conversion/linen_nnx_converter.py	87.82%	22 Missing and 16 partials ⚠️
src/maxtext/utils/maxtext_utils.py	60.21%	34 Missing and 3 partials ⚠️
src/maxtext/utils/train_utils.py	43.75%	23 Missing and 4 partials ⚠️
src/maxtext/utils/model_creation_utils.py	65.62%	11 Missing ⚠️
src/maxtext/utils/gradient_accumulation.py	55.55%	3 Missing and 5 partials ⚠️
src/maxtext/utils/muon_utils.py	66.66%	4 Missing and 3 partials ⚠️
...rc/maxtext/utils/generate_param_only_checkpoint.py	50.00%	4 Missing and 1 partial ⚠️
src/maxtext/utils/sharding.py	90.74%	2 Missing and 3 partials ⚠️
... and 3 more

📢 Thoughts on this report? Let us know!

…af count mismatch

- pure_nnx: a flag to to choose pure NNX logic when NNX and linen models co-exist. - init_state_fn: a function to initialize the model state for the training. It will be set to different function for NNX and Linen.

- Add utils to manipulate the NNX shardings with abstract state of a model - also add unit tests for the utils - Extract mesh creation function to maxtext_utils.get_mesh_from_config() - also add unit tests for this func Note: flax v0.12 has DeprecationWarning in multiple places: - DeprecationWarning: '.value' access is now deprecated. Use variable.get_value() or variable[...] (for [Array]). - DeprecationWarning: 'VariableState' was removed, this is just an alias to 'Variable'. Plase use 'Variable' directly instead. But since the code needs to work with post-training, which currently requires flax v0.11, we didn't change code for these warnings.

A TrainState for NNX, which includes model and optimizer Unit tests include checkpoint tests: - restore a saved state - convert linen TrainState to NNX TrainState - Parameter only restore (no opt_state)

…ion_utils Also added unit tests. Refactored model_creation_utils to provide common create_nnx_abstract_model() func. b/src/maxtext/utils/model_creation_utils.py

1. A new func get_abstract_state_nnx() is added to maxtext_utils.py The it will be called during training to create NNX training state. Same as the linen version, it handles shard_optimizer_over_data, optimizer_memory_host_offload, and parameter_memory_host_offload Unit tests are added to this NNX func. 2. Add nnx train_state handling in train_utils.py DPO handling will be supported (or removed) later in train_utils.py

Also added unit tests

Also added unit tests for NNX model.

- get_functional_train_with_signature: use (state, batch) shardings when pure_nnx=True - get_functional_eval_with_signature: use (state, batch) shardings when pure_nnx=True

- Convert nnx.State to pure dict for checkpoint saving - Restore pure dict back to nnx.State after loading

Add a bidirectional Linen <-> NNX checkpoint converter tool that handles: - Auto-detection of checkpoint format - Conversion of params structure (double nesting vs flat) - Stacking/unstacking per-layer parameters - Value wrapper handling for NNX format

Add a tool to compare checkpoint tree structures, shapes, and values across Linen and NNX formats.Supports cross-format and same-format comparisons with auto-detection, layer axis transposition, and RNG filtering.

- Add --pure_nnx CLI flag to run_sharding_dump.py - Propagate pure_nnx=true to the sharding_dump subprocess when flag is set - Refactor run_single_dump() to build the command as a list for conditional flag appending

- Replace nn.Dropout with linears.Dropout in gpt_oss and olmo3 decoder layers - Add num_activations logical axis rule to base.yml - Fix integration and unit tests for NNX compatibility I will relocate these files accordingly once the work is done.

github-actions · 2026-04-19T16:14:07Z

This PR has been automatically marked as stale because it has not had recent activity. It will be closed soon if no further activity occurs. Thank you for your contributions.

ecnal-cienet force-pushed the lance/nnx_all branch 2 times, most recently from 08d190d to 3ed2aad Compare March 16, 2026 21:24

Charles Li and others added 16 commits March 19, 2026 14:56

Fix unit test errors

ba514ac

NNX: exclude Intermediate variables from scan state to fix qk_clip le…

87e8bd4

…af count mismatch

NNX migration preparation: pure_nnx flag and init_state_fn

12f36ba

- pure_nnx: a flag to to choose pure NNX logic when NNX and linen models co-exist. - init_state_fn: a function to initialize the model state for the training. It will be set to different function for NNX and Linen.

NNX TrainState and unit tests

1b03155

A TrainState for NNX, which includes model and optimizer Unit tests include checkpoint tests: - restore a saved state - convert linen TrainState to NNX TrainState - Parameter only restore (no opt_state)

NNX migration: add NNX support to muon_utils and refactor model_creat…

d621e03

…ion_utils Also added unit tests. Refactored model_creation_utils to provide common create_nnx_abstract_model() func. b/src/maxtext/utils/model_creation_utils.py

NNX: add maybe_update_params_sharding_with_opt_nnx

f86351e

Also added unit tests

NNX: support nnx model in the gradient accumulation

fc11712

Also added unit tests for NNX model.

NNX: update train/eval step sharding signatures to omit rng for pure_nnx

a9d1875

- get_functional_train_with_signature: use (state, batch) shardings when pure_nnx=True - get_functional_eval_with_signature: use (state, batch) shardings when pure_nnx=True

NNX: fix checkpointing in the training loop

bd6f280

- Convert nnx.State to pure dict for checkpoint saving - Restore pure dict back to nnx.State after loading

NNX migration: modify the print_shardings_params to support NNX

b153e8c

NNX: add checkpoint comparison utility for Linen vs NNX validation

21575bb

Add a tool to compare checkpoint tree structures, shapes, and values across Linen and NNX formats.Supports cross-format and same-format comparisons with auto-detection, layer axis transposition, and RNG filtering.

NNX: add --pure_nnx flag to run_sharding_dump.py

53f8304

- Add --pure_nnx CLI flag to run_sharding_dump.py - Propagate pure_nnx=true to the sharding_dump subprocess when flag is set - Refactor run_single_dump() to build the command as a list for conditional flag appending

ecnal-cienet force-pushed the lance/nnx_all branch from 3ed2aad to 695aecb Compare March 19, 2026 19:44

github-actions Bot added the stale Automatically applied to stale PRs. label Apr 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lance/nnx all#3425

Lance/nnx all#3425
ecnal-cienet wants to merge 16 commits intomainfrom
lance/nnx_all

ecnal-cienet commented Mar 16, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ecnal-cienet commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov Bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ecnal-cienet commented Mar 16, 2026 •

edited

Loading

codecov Bot commented Mar 16, 2026 •

edited

Loading